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Abstract 



A modern graphics processing unit (GPU) is able to perform massively parallel scientific computations at low cost. 
We extend our implementation of the checkerboard algorithm for the two dimensional Ising model [T. Preis et al., 
J. Comp. Phys. 228, 4468 (2009)] in order to overcome the memory limitations of a single GPU which enables us to 
simulate significantly larger systems. Using multi-spin coding techniques, we are able to accelerate simulations on a 
single GPU by factors up to 35 compared to an optimized single Central Processor Unit (CPU) core implementation 
which employs multi-spin coding. By combining the Compute Unified Device Architecture (CUDA) with the Message 
Parsing Interface (MPI) on the CPU level, a single Ising lattice can be updated by a cluster of GPUs in parallel. For 
O I 1 large systems, the computation time scales nearly linearly with the number of GPUs used. As proof of concept we 

reproduce the critical temperature of the 2D Ising model using finite size scaling techniques. 
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_C 1. Introduction 

Various scientific disciplines profited by GPU computing in recent years and are reporting impressive speedup fac- 
tors in comparison to single Central Processor Unit (CPU) core implementations. GPU stands for Graphics Processing 
J> ■ Units which are high-performance many-core processors that can be used to accelerate a wide range of applications. In 

^sO \ the meantime, significant savings of computing time have been reported by a huge variety of fields: GPU acceleration 

can be used in astronomy yj] and radio astronomy yfl. Soft tissue simulation yfl, algorithms for image registration 
|4J], dose calculation |5[], volume reconstruction from x-ray images |6J, and the optimization of intensity-modulated 
radiation therapy plans |7|] are examples for the numerous applications in medicine. Furthermore, DNA seq uence 
alignment [81], molecular dynamics simulations 19|[lO|,[ljJ], quantum chemistry 11211 . multipole calculations 11311 . den- 
sity functional calculations jl4ll5ll . air pollution modeling II 1611 . time series analysis focused on financial markets 
[171LL8J], and Monte Carlo simulations il9ll20[|2Ul22ll benefited from GPU computing. For many applications, the 



accuracy can be comparable to that of a double-precision CPU implementation, such as in 12311 — the latest generation 
of GPUs support not only single precision but also double precision floating point operations. The adaption of many 
computational methods is still in progress, e.g. the analysis of switching processes in financial markets 1124 12511 . Un- 
fortunately, not all algorithms can be ported efficiently onto a GPU architecture. Particularly, serial algorithms are not 
suited for GPU computing (for an example see e.g. |26J]). 

Another crucial limitation is the lack of scalability as current programs typically utilize only single GPUs. As 
graphics processing hardware is targeted at a broad consumer market — the games industry — , graphic cards can be 
produced at low cost. On the other hand, to keep production costs low, the global memory is not upgradable and 



Source code of our implementations for GPU clusters will be published on http : //www . tobiaspreis . de after acceptance. In addition, the 
code can be downloaded from the Google Code project multigpu-ising. 
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typically limited to 1 GB for consumer cards and 4 GB for Tesla GPUs. Using a recent consumer graphics card, 



we accelerated Monte Carlo simulations of the Ising model 12211 . In 12211 . a 2D square spin lattice of dimension up 



to 1024 2 spins could be processed on a consumer GPU. The I sing model as a standard model of statistical physics 



provides a simple microscopic description of ferromagnetism 12711 . It was introduced to explain the ferromagnetic 
phase transition from the paramagnetic phase at high temperatures to the ferromagnetic phase below the Curie tem- 
perature Tq. A large variety of techniques and methods in statistical physics have originally been formulated for the 



Ising model and were generalized and adapted to related models and problems 128H . Due to its simplicity, which can 
be embodied by the possibility to use trivial parallelization approach es JI2 9 | . the two dimensional Ising model is well 
suited as a benchmark model since its properties are well studied 13(il3lLl32ll and many physical systems belong to the 
same universality class. The Isin g m odel on a two dimensional square lattice with no magnetic field was analytically 
solved by Lars Onsager in 1944 0311 . The critical temperature at which a second order phase transition between an 
ordered and a disordered phase occurs can be determined analytically for the two dimensional model (Tc ~ 2.269185 
H). 

Here we show how lattice sizes can be extended up to 100, 000 2 spins on one GPU device with 4 GB of global 
memory using a memory optimized encoding of the spins — one bit per spin. This number of spins turns out to be 
a hard limitation on a single device, since for larger system sizes, spin data would have to be transferred between 
device and host memory. Such a memory transfer would effectively rule out all performance benefits of a GPU 
implementation. Using a multi-spin coding scheme 134 I35J l36l 13711 . computation to memory access ratio can be 



improved, resulting in a dramatically faster GPU performance. 

We show that an extension of this approach can be used successfully to handle Monte Carlo simulations of the 
Ising model in a multi-GPU environment — GPU clusters. The scalability of this implementation is ensured by splitting 
the lattice into quadratic sublattices, and by placing them into the memory of different GPUs. Thus, each GPU can 
perform the calculation of one sublattice in its memory and pass the information about its borders on to its neighboring 
GPUs. Similar approaches have been used, e.g., for the calculation of density functionals 1 1-411 - 

This paper is organized as follows. In a brief overview in Sect. [2] key facts of the GPU architecture are provided 
in order to clarify implementation constraints for the following sections. Sect.[3]provides a survey of model definition 
and finite size scaling techniques used as proof of concept. In Sect.|4]and Sect. [5] we describe details of the reference 
CPU implementation and our single GPU approach based on multi-spin coding. The multi-GPU accelerated Monte 
Carlo simulation of the 2D Ising model is covered in Sect. [6] To overcome the memory limitations of a single GPU 
with such a multi-GPU approach is of crucial importance as GPU clusters are currently set up in supercomputing 
facilities. Our conclusions are summarized in Sect. [7] 

2. GPU Device architecture 

Simulating the Ising model at large system sizes requires a lot of processing performance. Physical and engineer- 
ing obstacles in microprocessor design have resulted in flat performance growth for traditional single-core micropro- 
cessors. On the other hand, graphics hardware has become highly programmable, the fixed function pipelines have 
been replaced by programmable shader units that can perform the same operation on many sets of data in parallel. For 
a comprehensive overview of recent developments in computer graphics, especially programmable shader techniques, 



see 138H . With new, more flexible programming interfaces, these units can be utilized to perform general purpose 
computing in fields other than computer graphics. 

For our GPU implementation, we use the Compute Unified Device Architecture (CUDA) released by NVIDIA for 
their recent graphics accelerator boards. The latest stable release at the time of writing is CUDA 2.3 13911 . Recently, 
other Application Programming Interfaces (APIs) for General Purpose computing on GPUs (GPGPU) became avail- 
able, see e.g. OpenCL 140(1 . Additionally, efforts have been made to establish high-level programming environments 



1 4111 as well as the integration into existing compilers. 

We use a NVIDIA Tesla CI 060 as our CUDA enabled device, which offers 4 GB of GDDR3 global memory, see 
Table Q] This memory can store a multi-spin coded spin field of 100, 000 2 spins on one GPU. The reference CPU 
used in our tests is the Intel Xeon X5560 at a clock rate of 2.80 GHz and 8192 kB cache. The purpose of the CPU 
implementation is to have a fast and fair non-parallel reference implementation, not to benchmark the Intel CPU. 
Therefore, only one core of the CPU is used (without Hyper-Threading Technology). 



Table 1: Key facts and properties of the GPU 142 



Tesla CI 060 


Global video memory 


4096 MB 


Streaming processor cores 


240 


Shared memory per block 


16KB 


Processor clock 


1.30 GHz 


Memory clock 


800 MHz 


Maximal power consumption 


187.8 W 



CUDA implements a Single Instruction Multiple Thread (SIMT) approach. It is capable of running the same code 
in parallel, processed in a "grid". A grid is a number of blocks which in turn contain a defined number of threads. It 
extends the C language by the invocation of "kernels" that run in parallel in such a grid on the GPU: 

cuda_kernel<<<gridDim, blockDim>>>(data) ; 

The variable gridDim defines the number of blocks that run in parallel, and blockDim specifies the number of threads 
that run in each block. Threads in each block share a certain amount of "shared memory" which can be accessed 
roughly one order of magnitude faster than data in the global GPU memory. The Tesla C 1 060 is capable of processing 
a maximum of 512 threads per block. Kernels are executed on the actual hardware in units of "warps", where each 
warp executes one common instruction at a time. 

With a larger number of threads in a block, memory latencies can be hidden more effectively. However, the hiding 
of memory latencies only results in better performance, if the number of registers used by a single thread is sufficiently 
small. To optimize execution time for a kernel, a grid size should be used that allows for a maximum "occupancy". 
The occupancy is the ratio of active warps to the maximum number of warps supported on a multiprocessor of the 
GPU. A multiprocessor contains amongst others eight scalar processor cores, a multi-threaded instruction unit, and 
shared memory. This ratio is a helpful number to determine how efficient the kernel will be on the GPU. 

For the multi-GPU implementation, each CPU core runs a separate process and controls one of the available 
GPUs. Communication is established via the Message Passing Interface (MPI). Communication is needed frequently 
(see Sect. [6]) which leads to a bottleneck for small systems, but is ruled out by the benefit of more available GPU cores 
for larger systems. 

3. The two-dimensional Ising model 

The Ising model is formulated on a two-dimensional square lattice, where on each lattice site a spin 5, with a 
value of either -1 or 1 is located. The interaction of the spins is given by the Hamiltonian 

,y<? = -jY^s£j- H Yj Si (1) 

Q,j) i 

where H denotes an external magnetic field, which we will set to zero here. The lattice is updated according to the 



Metropolis criterion 14311 . For each step, the energy difference A^ = ,¥?„ - iffii between two subsequent states a 
and b is calculated. The probability for the step to be accepted is given by W„^b - ex-pi-AJf/lcgT) if AJt? > and 
W a ^b - 1 if A^f < 0. Since only discrete values for this factor are possible, they should be pre-calculated on the 
CPU for each temperature and transferred to the GPU when the kernels are invoked. 

To make efficient use of the GPU device structure, a parallelizable spin-update scheme has to be utilized. The 
ratio between memory latency and processing time on graphics cards is very large 13911 . Thus, GPU cores can perform 
hundreds of instructions in the time of a single access to the global memory. By highly parallel processing, memory 
access latencies can be hidden effectively, and large acceleration factors achieved. 



Parallel spin updates of the Ising model can only be done for non-interacting domains. The approach that each 
spin only interacts with its four nearest neighbors makes a checkerboard update feasible 112211 . The lattice update is 
divided into two update steps A and B. In step A, only the spins residing on a black site are updated since they are not 
interacting with each other. In step B, the spins on white lattice sites are updated. It is essential that update step B is 
started after all updates of step A are finished. Please note, that other methods for the spin updating process are also 



available, e.g. diverse cluster algorithms 11441 I45H . perform particularly well close to the critical point. However, the 
systematic scheme of the checkerboard algorithm is most suitable for the GPU architecture realizing non-interacting 
domains where the Monte Carlo moves are performed in parallel. 

In order to test the correctness of the implementation, we determine the critical temperature of the Ising spin 
system. We use finite size scaling and calculate the Binder cumulant 114a. 13011 



ii m - 1 {M(Tf) m 

with M denoting the magnetization of a configuration at temperature T and (. . .) denoting the thermal average. Near 
a critical point, finite size scaling theory predicts the free energy and derived quantities like the magnetization to be a 
function of linear dimension L over correlation length f ^ (T - TcY v - Therefore, moment ratios of the magnetization 
like, e.g. the Binder cumulant U\, become independent of system size N = « 2 at the critical temperature Tq. To test 
our implementation, we perform several simulations close to the critical point for various linear dimensions n of the 
simulation box and determine t/4. 

4. Optimized reference CPU implementation 

For our optimized CPU reference implementation, we focus on a single spin-flip approach which performs well 
for large lattice sizes. Multi-spin coding refers to all techniques that store and process multiple spins in one unit of 
computer memory. In CPU implementations, update schemes have been developed that allow to process more than 
one spin with a single operation Il34ll35il36ll37ll . We use a scheme which encodes 32 spins into one 32-bit integer in 



a linear fashion. The 32-bit type is chosen since register operations of current hardware perform fastest on this data 
type. The key ingredient for an efficient update algorithm of these 32-bit patterns is to use precomputed bit patterns 
that encode the evaluations of the flip condition expression 

r < exp(-AJt?/k B T) (3) 

for every single spin bit — the variate r is an independent and identically-distributed random number in [0, 1). Since 
there are only two possible energy differences AJ^ 7 with A^f > 0, two Boolean arrays can encode the information of 
an evaluation of the flip condition. For reasonable results, N. Ito 13611 suggested to use a pool of 2 22 to 2 24 Boltzmann 
patterns. 

We call the two Boolean arrays exp4 and exp8. Our encoding is chosen to store a 1 into exp4 if exp(-8J/kBT) < 
r < exp( -AJIksT) and a if not, and a 1 into exp8 if r < exp(-8J/lcBT) and a if not. 

For every spin update, the Monte Carlo simulation will draw a random address in this pool, and return the bit 
patterns at this address. To process a 32-bit spin pattern sg, neighbor patterns s\, S2, si, S4 have to be prepared, that 
contain the neighbors of the /th spin at their /th digit. 

Since the calculation is the same for every bit, it is convenient to look at just one bit. The first step is to transform 
the spin variables into energy variables i„ to get rid of the dependence of the initial state of 50. 

i n = s A s n ,Vn e {1,2,3,4} (4) 

where A denotes an exclusive-or operation. Because of the special encoding of the Boltzmann patterns, the acceptance 
condition for each spin can be expressed in a simple way: 

i\ + ii + 13 + in + 2 • exp^ + exp4 s > 2 (5) 

where exp8, and exp4 s denote the sth Boltzmann patterns that encode the spin flip condition, and s is a random 
position in the pool. It is possible to evaluate this expression for all 32 bits at once by applying a sequence of bitwise 
Boolean operations 0611 . 

4 



Since parallel updates are only allowed on non-interacting domains, an additional bit mask has to be applied to the 
update pattern, that only allows to flip every second spin at once. Details of the implementation can be found in [49j]. 



5. Single-GPU implementation 

5.1. Problems arising from a straightforward porting 

On graphics cards, memory access is very costly compared to operations on registers. The great advantage of 
multi-spin coding is that only one memory access is needed to obtain several spins at once. The CPU implementation 
could be ported to GPU with a kernel that uses less than 16 registers. This allows an optimal occupancy of the GPU 
up to a maximum block size of 512 threads. Even though the update scheme presented in Sect. [4] performs vastly 
faster on the CPU compared to an implementation with integer representations of each spin, a straightforward GPU 
port of this scheme is not optimal. 

The reason for the poor performance is that parallel threads in one warp have to access global memory in a random 
fashion which is very costly. The execution speed can be improved by drawing only one random position per block, 
and let all the threads in this block read the patterns linearly, starting from the drawn starting position. This however, 
reduces the quality of the flip patterns — this could in principle be compensated by using a significantly larger pool of 
random numbers. Another option is to calculate the spin flip patterns on the fly using a random number generator on 
the GPU (see I5.31 l instead of looking them up from the global memory. It turns out however, that the sophisticated 
update scheme has no benefit here, anymore. The performance of these implementations is compared in Table |2] It 
should be emphasized that the quality of random numbers differs between the implementations. In the next section, 
we present another update scheme that works well on the GPU, which prevents pitfalls with the quality of the random 
numbers. 

5.2. Extraction into shared memory 
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Figure 1 : (Color online) The spin lattice is processed by a variable number of blocks (a), where each block runs a variable number of threads (b). 
The threads update the spin lattice in two steps, A and B, using two kernel invocations (c). 

The main goal of the following implementation is to reduce access to the global memory of the GPU, which is 
extremely costly. The best performance without pre-calculating flip patterns can be achieved by extracting the single 
spins into shared memory and performing the calculations on integer registers. The spin field on the graphics card is 
encoded in quadratic blocks of 4 x 4 spins (hereafter referred to as "meta-spins") which can be stored as binary digits 
of one unsigned short integer (2 bytes), which can be accessed by a single memory lookup. Single spin values can be 
extracted from one meta-spin for example using the expression 



s[x,y] = ((meta-spin & (1 « (y * 4 + x) ) ! = 0) * 2 
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Figure 2: (Color online) (a) The way a kernel processes a 4 X 4 meta-spin. (b) Spins are extracted into shared memory and an update pattern is 
created (c). (d) Afterwards, the new spins are obtained using the update pattern (Spins on blue sites will be flipped, spins on white sites will not be 
flipped), and written back to global memory. 



which returns a value of either -1 or 1. Here, "«" denotes a bitwise left-shift operation This is a slightly more 
complicated expression than for a linear layout, but it makes sense for a multi-GPU implementation, where border 
information has to be transferred between various GPUs (see Sect. |5). This approach realizes that each spin uses 
exactly one bit of memory. The spin field is stored in global memory, which is expensive to access. 

To process the spin field on the GPU, the spin field is subdivided into quadratic subfields which can be processed 
by threads grouped into one block (see Fig. QJ. Each thread of this block processes a "meta-spin" of 4 x 4 spins. At 
the beginning of a kernel, it retrieves 5 meta-spins from the global memory, namely its own and its four neighboring 
meta-spins (Fig. |2^). This information is used to extract the information of the 4x4 spins. Each thread will store 
the spin field of 4 x 4 spins as well as the neighboring spins in a 6 x 6 integer array in shared memory, which allows 
for fast computation of the spin flips. The spin update is performed in two steps as described before. A first kernel is 
needed to update the "black sites" on a checkerboard pattern, and a second processes the "white sites". The update 
kernel for the white sites has to wait until all black sites have been updated. Thus, two separate kernels are needed. 
There is no other way to achieve global synchronization between the threads. 

Each kernel creates an update pattern, where each binary digit indicates if the associated spin has to be flipped or 
not. At the end of the kernel execution, the 4x4 meta-spins are updated with one single global memory write. 

In summary, each update thread executes the following steps: 

1 . Look up meta-spins from global memory (See figure 2a) 

2. Extract meta-spins into 6x6 integer array in shared memory which then contains the 4x4 meta-spins and the 
neighbors (See figure 2b) 

3. For all 8 white/black sites Sj in the 4x4 field, draw a random number and evaluate the Metropolis criterion 

4. Generate the update pattern (set the /th bit to 1, if the flip of the /th was accepted) (Figure 2c) 

5. Update the meta-spin by an XOR operation with the update pattern to obtain the spins at the next timestep (Figure 
2d) 

Although the update scheme sounds hardly efficient, it dramatically reduces global memory access compared to 
the previous implementation, which results in faster computation times on GPU hardware. 

6 



Table 2: Comparison at lattice size 4096x4096: CPU simple encodes one spin in one integer, CPU multi-spin coding uses the efficient "multi-spin" 
update scheme presented in section[4] multi-spin unmodified is a straightforward porting of this update to the GPU, multi-spin coding on the fly uses 
the same scheme but calculates the update patterns at each update step on the fly, and multi-spin coding linear determines one starting position in 
the random number pattern in the pool per block, and lets the threads read the random numbers linearly from that position on. The shared memory 
implementation (see section 15^21 provides best quality random numbers. A preliminary test run on a Fermi GPU shows a factor of 1.82 compared 
toaTeslaC1060. 





Spinflips per ps, 


Relative speed 


CPU simple 


26.6 


0.11 


CPU multi-spin coding 


226.7 


1.00 


shared memory 


4415.8 


19.50 


shared memory (Fermi) 


8038.2 


35.46 


multi-spin unmodified 


3307.2 


14.60 


multi-spin coding on the fly 


5175.8 


22.80 


multi-spin coding linear 


7977.4 


35.20 



After the update is completed, the magnetization per spin m(T) has to be extracted from the lattice. In a first step, 
the magnetization of each block can be summed using the shared memory of each block by employing a binary tree 
reduction and writing out the total magnetization of the slice back to the main memory. The final summation of the 
magnetizations of the blocks can be done either on the CPU or on the GPU at about equal speeds. 

5.3. Random number generation 

For every update thread a random number is needed, either to decide if the spin is flipped or not, or to look up an 
update pattern in global memory. This is why an efficient method to create random numbers is needed. 

In our implementation, we use an array of linear congruential random number generator (LCRNG) which is one 



of the oldest and most studied algorithm to generate pseudo random numbers 04711 . A single random number generator 
provides the random numbers for every update thread j. A sequence of random numbers for the y'th thread Xtj (where 
i e N) is generated by the recurrence relation 

x i+i.j — ( a ' x i,j + c)mod m (6) 

where a, c and m are integer coefficients. An appropriate choice of these coefficients is responsible for the quality 



of the produced random numbers. We use a - 1664525 and c = 1013904223 as suggested, e.g., in [48]. Since by 
construction, results on a 32-bit register are truncated to the endmost 32 bits, the modulo operation m is set to 2 32 . By 
normalizing (yij — abs(x,-, / /2 31 )) the LCRNG can be used to generate random numbers y*j in the interval [0; 1). For 
the GPU, an array of random numbers that provides a single random number seed for every spin update thread can be 
generated by the iteration 

xoj+i = (16807 • xqj) mod m (7) 

with xo,o = 1 ■ 

5.4. Performance comparison 

The multi-spin implementations are compared to simple implementations on both CPU and GPU. As a measure- 
ment for the performance of an implementation, we use the number of single spin flips per second, which also allows 
to compare results for different lattice sizes. The temperature is set to 0.99 Tq- 

The GPUs perform most efficient for lattice sizes of a linear dimension beyond 4096 x 4096. For this lattice size, 
a GPU is faster by a factor of about 15-35, depending on the implementation and the resulting quality of random 



numbers. For the simple implementation used in [22], between 1024 x 1024 and 2048 x 2048 spins the spin field size 
becomes comparable to the CPU L3 cache size, which leads to a higher rate of costly L3 cache misses. This is the 
point at which the simple implementation becomes inefficient. 
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Figure 3: (Color online) Benchmarking the implementations: The system is simulated at a constant temperature of T = 0.99 Tc- The performance 
of the straight port varies strongly with lattice size because of the large block size of 512 threads, while the shared memory implementation offers 
stable performance over a wide range of sizes and offers better quality random numbers (comparable to the simple CPU implementation). The 
dotted line shows a preliminary benchmark of a Fermi GPU which became available in April 2010. A GeForce GTX 480 provides the following 
features: 1536 MB global memory, 480 streaming processor cores, 1.40 GHz processor clock, 1848 MHz memory clock, and a maximal power 
consumption of 250 W. 



6. Multi-GPU approach 

6.1. Implementation 

The general idea is to extend the quadratic lattice by putting multiple quadratic "meta-spin" lattices next to each 
other in a super-lattice (see Fig. [4^ for a 2x2 super-lattice) and let each lattice be handled by one of the installed 
GPUs. On the border of each lattice, at least one of the neighboring sites is located in the memory of another GPU 
(see Fig. [4j}). For this reason, the spins at the borders of each lattice have to be transferred from one GPU to the GPU 
handling the adjacent lattice. This can be realized by introducing four neighbor arrays containing the spins of the 
lattices' own borders, and four arrays for storing the spins of its adjacent neighbors (Fig.HJ:). 

At the beginning of the execution, each MPI process initializes its own spin lattice, writes out its border spins into 
its own border arrays and sends them to its neighbors. In return it receives the adjacent borders from the according 
MPI processes. After this initialization phase, spins and random seeds are transferred to the GPU. 

Then, a single lattice update has to be performed in the following way: 
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Figure 4: (Color online) (a) Each GPU processes a "meta-spin" lattice of size N 



The lattices are aligned on a super-lattice, and the outer 



borders are connected via periodic boundary conditions. In this example, 4 GPUs process a system of 2 2 ■ N spins, (b) A meta-spin update needs the 
4 nearest neighbor meta-spins. On the borders of a lattice, each GPU needs the spin information of the neighboring lattices. The border information 
has to be passed between the GPUs. In our implementation this is done by using 8 neighbor arrays. 



1 . Copy neighbor borders to GPU memory 

2. Call kernel to perform update (A) 

3. Call kernel to extract borders from the spin array 
to own borders array 

4. Copy own borders to host memory 

5. Exchange borders with the other MPI processes 



6. Copy neighbor borders to GPU memory again 

7. Call kernel to perform update (B) 

8. Call kernel to extract borders from spin array again 

9. Transfer own borders to host memory 

10. Exchange borders with other MPI processes 

1 1 . Retrieve processed data from GPU 



It turns out that the transfer time was not the limiting factor for our purposes but rather the latency of the memory 
accessed. 

6.2. GPU cluster performance 

For performance measurements on the GPU cluster, the shared memory implementation (see Sect. 15.2b was used, 
since it provided stable performance for various lattice sizes and because the memory layout is symmetric in the x and 
y direction, resulting in symmetric communication data. The tests were run on a GPU cluster with two Tesla CI 060 
GPUs in each node. Communication is established via Double Data Rate InfiniBand. The performance for various 
system sizes (see Fig- EJ provides evidence that for more than one GPU, spin flip performance scales nearly linearly 
with the amount of GPUs. The drop from one GPU to four GPUs is due to the communication overhead produced by 
exchanging borders. For larger system sizes, the communication overhead per CPU/GPU remains constant. An opti- 
mal performance is reached for lattice sizes beyond 4096 x 4096 per GPU. For 64 GPUs — the NEC Nehalem Cluster 
maintained by the High Performance Computing Center Stuttgart (HLRS) provides 128 GPUs — , a performance of 
206 spinflips per nanosecond can be achieved on a 800. 000 2 2D Ising lattice, i.e. we can update the whole lattice in 
about three seconds. 



7. Conclusion 

We presented two major improvements over our previous work. By using multi-spin coding techniques, we 
improved the computation to memory access ratio of our calculations dramatically, resulting in better overall per- 
formance. On a single GPU, up to 7.9 spinflips per nanosecond are possible, 15 to 35 times faster than our highly 
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Figure 5: (Color online) Cluster performance for various system sizes (per GPU). For more than one GPU, spin flip performance scales nearly 
linearly with the amount of GPUs. Again, optimal performance is reached at a lattice size of about 4096 X 4096 per GPU. Using 64 GPUs, a 
performance of 206 spinflips per nanosecond can be achieved on a 800.000 X 800.000 lattice. 

optimized CPU version , depending on the implementation and the quality of random numbers. The other improve- 
ment targets the utilization of GPU clusters, where the 2D Ising lattice is distributed over many GPUs. We show that 
our implementation scales nearly linearly with the number of GPUs, which allows us to process huge Ising lattices 
on GPU clusters. Preliminary tests on a NVIDIA GPU of the latest generation — the Fermi architecture which offers 
twice the amount of streaming processor cores — indicate an additional speedup of roughly 1.8 compared to a Tesla 
CI 060. 
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